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Abstract 



Accelerated algorithms for maximum likelihood image reconstruction are essen- 
tial for emerging applications such as 3D tomography, dynamic tomographic imag- 
ing, and other high dimensional inverse problems. In this paper, we introduce and 
analyze a class of fast and stable sequential optimization methods for computing 
maximum likelihood estimates and study its convergence properties. These methods 
are based on a proximal point algorithm implemented with the Kullback-Liebler (KL) 
divergence between posterior densities of the complete data as a proximal penalty 
function. When the proximal relaxation parameter is set to unity one obtains the 
classical expectation maximization (EM) algorithm. For a decreasing sequence of re- 
laxation parameters, relaxed versions of EM are obtained which can have much faster 
asymptotic convergence without sacrifice of monotonicity. We present an implemen- 
tation of the algorithm using More's Trust Region update strategy. For illustration 
the method is applied to a non-quadratic inverse problem with Poisson distributed 
data. 
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1 Introduction 



Maximum likelihood (ML) or maximum penalized likelihood (MPL) approaches have 
been widely adopted for image restoration and image reconstruction from noise con- 
taminated data with known statistical distribution. In many cases the likelihood 
function is in a form for which analytical solution is difficult or impossible. When 
this is the case iterative solutions to the ML reconstruction or restoration problem 
are of interest. Among the most stable iterative strategies for ML is the popular 
expectation maximization (EM) algorithm [8J. The EM algorithm has been widely 
applied to emission and transmission computed tomography [39, 231 [36] with Poisson 
data. The EM algorithm has the attractive property of monotonicity which guar- 
antees that the likelihood function increases with each iteration. The convergence 
properties of the EM algorithm and its variants have been extensively studied in the 
literature; see [42] and [15] for instance. It is well known that under strong concavity 
assumptions the EM algorithm converges linearly towards the ML estimator 9ml- 
However, the rate coefficient is small and in practice the EM algorithm suffers from 
slow convergence in late iterations. Efforts to improve on the asymptotic convergence 
rate of the EM algorithm have included: Aitken's acceleration [28], over-relaxation 
[26] . conjugate gradient [20] [T§], Newton methods [30] [3], quasi-Newton methods 
[22], ordered subsets EM fTT] and stochastic EM [25]. Unfortunately, these methods 
do not automatically guarantee the monotone increasing likelihood property as does 
standard EM. Furthermore, many of these accelerated algorithms require additional 
monitoring for instability [24j. This is especially problematic for high dimensional 
image reconstruction problems, e.g. 3D or dynamic imaging, where monitoring could 
add significant computational overhead to the reconstruction algorithm. 

The contribution of this paper is the introduction of a class of accelerated EM 
algorithms for likelihood function maximization via exploitation of a general relation 
between EM and proximal point (PP) algorithms. These algorithms converge and 
can have quadratic rates of convergence even with approximate updating. Proximal 
point algorithms were introduced by Martinet [29 1 and Rockafellar [38j , based on the 
work of Minty |31] and Moreau [33] , for the purpose of solving convex minimization 
problems with convex constraints. A key motivation for the PP algorithm is that 
by adding a sequence of iteration-dependent penalties, called proximal penalties, to 
the objective function to be maximized one obtains stable iterative algorithms which 
frequently outperform standard optimization methods without proximal penalties, 
e.g. see Goldstein and Russak [lj. Furthermore, the PP algorithm plays a paramount 
role in non-differentiable optimization due to its connections with the Moreau- Yosida 
regularization; see Minty [31) . Moreau [33] . Rockafellar |38j and Hiriart-Hurruty and 
Lemarechal [16J. 

While the original PP algorithm used a simple quadratic penalty more general 
versions of PP have recently been proposed which use non-quadratic penalties, and 
in particular entropic penalties. Such penalties are most commonly applied to ensure 
non-negativity when solving Lagrange duals of inequality constrained primal prob- 
lems; see for example papers by Censor and Zenios [5], Ekstein [10] , Eggermont [5], 
and Teboulle [40 . In this paper we show that by choosing the proximal penalty 
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function of PP as the Kullback-Liebler (KL) divergence between successive iterates 
of the posterior densities of the complete data, a generalization of the generic EM 
maximum likelihood algorithm is obtained with accelerated convergence rate. When 
the relaxation sequence is constant and equal to unity the PP algorithm with KL 
proximal penalty reduces to the standard EM algorithm. On the other hand for a de- 
creasing relaxation sequence the PP algorithm with KL proximal penalty is shown to 
yield an iterative ML algorithm which has much faster convergence than EM without 
sacrificing its monotonic likelihood property. 

It is important to point out that relations between particular EM and particular 
PP algorithms have been previously observed, but not in the full generality estab- 
lished in this paper. Specifically, for parameters constrained to the non-negative 
orthant, Eggermont [9] established a relation between an entropic modification of 
the standard PP algorithm and a class of multiplicative methods for smooth convex 
optimization. The modified PP algorithm that was introduced in [9] was obtained 
by replacing the standard quadratic penalty by the relative entropy between succes- 
sive non-negative parameter iterates. This extension was shown to be equivalent to 
an "implicit" algorithm which, after some approximations to the exact PP objective 
function, reduces to the "explicit" Shepp and Vardi EM algorithm [39J for image 
reconstruction in emission tomography. Eggermont [9] went on to prove that the 
explicit and implicit algorithms are monotonic and both converge when the sequence 
of relaxation parameters is bounded below by a strictly positive number. 

In contrast to [9], here we establish a general and exact relation between the 
generic EM procedure, i.e. arbitrary incomplete and complete data distributions, 
and an extended class of PP algorithms. As pointed out above, the extended PP 
algorithm is implemented with a proximal penalty which is the relative entropy (KL 
divergence) between successive iterates of the posterior densities of the complete data. 
This modification produces a class of algorithms which we refer to as Kullback-Liebler 
proximal point (KPP). We prove a global convergence result for the KPP algorithm 
under strict concavity assumptions. An approximate KPP is also proposed using 
the Trust Region strategy |32t [M] adapted to KPP. We show, in particular, that 
both the exact and approximate KPP algorithms have superlinear convergence rates 
when the sequence of positive relaxation parameters converge to zero. Finally, we 
illustrate these results for KPP acceleration of the Shepp and Vardi EM algorithm 
implemented with Trust Region updating. 

The results given here are also applicable to the non-linear updating methods 
of Kivinen and Warmuth [21] for accelerating the convergence of Gaussian mixture- 
model identification algorithms in supervised machine learning, see also Warmuth 
and Azoury [UJ and Helmbold, Schapire, Singer and Warmuth p3]. Indeed, simi- 
larly to the general KPP algorithm introduced in this paper, in fl4] the KL divergence 
between the new and the old mixture model was added to the gradient of the Gaus- 
sian mixture-model likelihood function, appropriately weighted with a multiplicative 
factor called the learning rate parameter. This procedure led to what the authors of 
|14] called an exponentiated gradient algorithm. These authors provided experimen- 
tal evidence of significant improvements in convergence rate as compared to gradient 
descent and ordinary EM. The results in this paper provide a general theory which 
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validate such experimental results for a very broad class of parametric estimation 
problems. 

The outline of the paper is as follows. In Section [2] we provide a brief review of 
key elements of the classical EM algorithm. In Section [3l we establish the general 
relationship between the EM algorithm and the proximal point algorithm. In section 
HI we present the general KPP algorithm and we establish global and superlinear 
convergence to the maximum likelihood estimator for a smooth and strictly concave 
likelihood function. In section [51 we study second order approximations of the KPP 
iteration using Trust Region updating. Finally, in Section [6] we present numerical 
comparisons for a Poisson inverse problem. 



2 Background 



The problem of maximum likelihood (ML) estimation consists of finding a solution 
of the form 

6ml = argmax eeRP l y (6), (1) 

where y is an observed sample of a random variable Y defined on a sample space y 
and l y (9) is the log- likelihood function defined by 

l y (9) = log g(y; 6), (2) 

and g(y; 9) denotes the density of Y at y parametrized by a vector parameter 9 in 
R p . One of the most popular iterative methods for solving ML estimation problems 
is the Expectation Maximization (EM) algorithm described in Dempster, Laird, and 
Rubin [8] which we recall for the reader. 

A more informative data space X is introduced. A random variable X is defined 
on X with density f(x;9) parametrized by 9. The data X is more informative than 
the actual data Y in the sense that Y is a compression of X, i.e. there exists a 
non-invertible transformation h such that Y = h(X). If one had access to the data 
X it would therefore be advantageous to replace the ML estimation problem (JT]) by 

9 M L = axgmax eeRp l x (9), (3) 

with l x {9) = log /(#;#). Since y = h(x) the density g of Y is related to the density 
/ of X through 

g(y;9)= I f(x;9)d^(x) (4) 
Jh-mv}) 

for an appropriate measure [i on X. In this setting, the data y are called incomplete 
data whereas the data x are called complete data. 

Of course the complete data x corresponding to a given observed sample y are 
unknown. Therefore, the complete data likelihood function l x {9) can only be es- 
timated. Given the observed data y and a previous estimate of 9 denoted 9, the 
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following minimum mean square error estimator (MMSE) of the quantity l x {9) is 
natural 

Q(9,9) = E[logf(x;9)\y;9], 

where, for any integrable function F(x) on X , we have defined the conditional ex- 
pectation 

E[F(x)\y;9]= f F(x)k(x\y;9)dn(x) 
Jh-H{y}) 

and k(x\y;9) is the conditional density function given y 

k(x\y;9) = lp§-. (5) 

The EM algorithm generates a sequence of approximations to the solution ([3]) starting 
from an initial guess 9° of 9ml and is defined by 

Compute Q(9, 9 k ) = E[log f(x; 9)\y; 9 k ] E Step 

gk+i = ar gmax eeRP Q(6>, 9 k ) M Step 



A key to understanding the convergence of the EM algorithm is the decomposi- 
tion of the likelihood function presented in Dempster, Laird and Rubin [8]. As this 
decomposition is also the prime motivation for the KPP generalization of EM it will 
be worthwhile to recall certain elements of their argument. The likelihood can be 
decomposed as 

l y (9) = Q(9,9) + H(9,9) (6) 

where 

H(9,9) = -E[logk(x\y;9)\y;9}. 
It follows from elementary application of Jensen's inequality to the log function that 

H{9, 9) > H(9, 9) > 0, V0, 9 e R p . (7) 



Observe from (J6|) and ([7]) that for any 9 k the 9 function Q(9, 9 k ) is a lower bound 
on the log likelihood function l y {9). This property is sufficient to ensure monotonicity 
of the algorithm. Specifically, since the the M-step implies that 

Q{9 k+ \9 k ) > Q{9 k ,9 k ), (8) 

one obtains 

l y {9 k+l ) - l y {9 k ) > Q{9 k+l ,9 k )-Q{9 k : 9 k ) (9) 
+H(9 k+ \9 k ) - H(9 k ,9 k ). 

Hence, using ([8]) and ([7]) 

ly{9 k+1 )>ly(9 k ). 

This is the well known monotonicity property of the EM algorithm. 



Note that if the function H(9, 9) in ([6]) were scaled by an arbitrary positive factor 
f3 the function Q(9,9) would remain a lower bound on l y (9), the right hand side of 
([9]) would remain positive and monotonicity of the algorithm would be preserved. As 
will be shown below, if f3 is allowed to vary with iteration in a suitable manner one 
obtains a monotone, superlinearly convergent generalization of the EM algorithm. 
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3 Proximal point methods and the EM algo- 
rithm 



In this section, we present the proximal point (PP) algorithm of Rockafellar and 
Martinet. We then demonstrate that EM is a particular case of proximal point 
implemented with a Kullback-type proximal penalty. 



3.1 The proximal point algorithm 

Consider the general problem of maximizing a concave function < 3?(#). The proximal 
point algorithm is an iterative procedure which can be written 

9 k+1 = argmax eeRP - ^\\9 - fc || 2 } . (10) 

The quadratic penalty \\9 — 9 k \\ 2 is relaxed using a sequence of positive parameters 
In |38| . Rockafellar showed that superlinear convergence of this method is 
obtained when the sequence {/3fc} converges towards zero. In numerical implementa- 
tions of proximal point the function 3>(0) is generally replaced by a piecewise linear 
model [16]. 



3.2 Proximal interpretation of the EM algorithm 



In this section, we establish an exact relationship between the generic EM proce- 
dure and an extended proximal point algorithm. For our purposes, we will need to 
consider a particular Kullback-Liebler (KL) information measure. Assume that the 
family of conditional densities {k(x\y; #)}fleRp is regular in the sense of Ibragimov 
and Khasminskii [15] ; in particular k(x\y;9)fi(x) and k(x\y;9)fi(x) are mutually ab- 
solutely continuous for any 9 and 9 in R p . Then the Radon-Nikodym derivative 
jgj-fegj exists for all 9,9 and we can define the following KL divergence: 

/,(**> -en^M]. (ID 



Proposition 1 The EM algorithm is equivalent to the following recursion with fit = 
I, fc=l,2,..., 

9 k+l = argmax eeRp [l y {9) - (3 k I y (9 k ,9)} (12) 

For general positive sequence {Pk} the recursion in Proposition [1] can be identi- 
fied as a modification of the PP algorithm (jlOH with the standard quadratic penalty 
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replaced by the KL penalty (jlip and having relaxation sequence {/3 k}- I n the se- 
quel we call this modified PP algorithm the Kullback-Liebler proximal point (KPP) 
algorithm. In many treatments of the EM algorithm the quantity 

Q(9,9) = l y (9)-l y (9)- 1(9,9) 

is the surrogate function that is maximized in the M-step. This surrogate objec- 
tive function is identical (up to an additive constant) to the KPP objective l y (0) — 
P k I y (9 k ,9) of when/3 fc = 1. 

Proof of Proposition [7J- The key to making the connection with the proximal point 
algorithm is the following representation of the M step: 

k+1 = argmax eeRP {log 5 (y; 6) + E[log jjh^lvi **] }• 

This equation is equivalent to 

9 k+1 = argmax eeRP {log 5 (y;fl) + E[log ^jr^lvi 

since the additional term is constant in 9. Recalling that k(x 

9 k+1 = argmax eeRP {log#(y;fl) + E [log k(x\y; 9)\y; 9 k ] 
-E[\ogk(x\y;9 k )\y;9 k ]}. 

We finally obtain 

9 k+1 = argmax eeRP {log 5 (y; 9) + E [log -^^U ; 9 k ] } 
which concludes the proof. □ 



4 Convergence of the KPP Algorithm 



In this section we establish monotonicity and other convergence properties of the 
KPP algorithm of Proposition [TJ 



4.1 Monotonicity 



For bounded domain of 9, the KPP algorithm is well defined since the maximum 
in (fT2j) is always achieved in a bounded set. Monotonicity is guaranteed by this 
procedure as proved in the following proposition. 



Proposition 2 The log-likelihood sequence {l y (9 k )} is monotone non- decreasing and 
satisfies 

iy(e k+1 )-i y (e k )>f3 k i y (9 k ,9 k+1 ), (is) 

Proof: From the recurrence in f|12|) . we have 

ly(6 k + 1 ) - ly(9 k ) > (3 k I y (9 k , 9^) - fa Iy , d"). 

Since I y (9 k ,9 k ) = and I y {9 k ,9 k+1 ) > 0, we deduce JEJ and that {/ y (0 fc )} is non- 
decreasing. □ 

We next turn to asymptotic convergence of the KPP iterates {9 k }. 

4.2 Asymptotic Convergence 

In the sequel VoiI y (9,9) (respectively Voil y (6,0)) denotes the gradient (respectively 
the Hessian matrix) of I y (9,9) in the first variable. For a square matrix M, Am 
denotes the greatest eigenvalue of a matrix M and Am denotes the smallest. 

We make the following assumptions 

Assumptions 1 We assume the following: 

(i) l y {9) is twice continuously differentiable on~RP and I y (9,9) is twice continuously 
differentiable in (9,9) in R p x R p . 

(ii) lim|i0n_ H 3 Q ly(9) = —oo where \\9\\ is the standard Euclidean norm on R p . 

(Hi) l y (9) < oo and Ay2^(g) < on every bounded 9-set. 

(iv) for any 9 in¥i p , I y {6,9) < oo andO < A V 2 r (6,6) — ^v 2 i ($,6) 071 every bounded 
9-set. 

These assumptions ensure smoothness of l y (9) and I y (9,9) and their first two 
derivatives in 9. Assumption [TJiii also implies strong concavity of l y {9). Assumption 
[TJiv implies that I y (9,9) is strictly convex and that the parameter 9 is strongly 
identifiable in the family of densities k(x\y;9) (see proof of Lemma Q] below) . Note 
that the above assumptions are not the minimum possible set, e.g. that l y (9) and 
I y (9, 9) are upper bounded follows from continuity, Assumption [TJii and the property 
I y (0,6) > Iy(9,9) = 0, respectively. 

We first characterize the fixed points of the KPP algorithm. 

A result that will be used repeatedly in the sequel is that for any 9 € R p 

V ily(9,9) = 0. (14) 
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This follows immediately from the information inequality for the KL divergence 
Thm. 2.6.3] 

I y (9,9)>I y (9,9) = 0, 
so that, by smoothness Assumption [TJi, I y (9,9) has a stationary point at 9 = 9. 



Proposition 3 Let the densities g(y;9) and k(x\y;9) be such that Assumptions 
are satisfied. Then the fixed points of the recurrence in fflfy are maximizers of the 
log-likelihood function l y {9) for any relaxation sequence f5k = f3 > 0, k = 1, 2, . . .. 



Proof: Consider a fixed point 9* of the recurrence relation (|12p for (3^ = (3 = constant. 
Then, 

9* = argmax eeRP {l y {9) - f3I y (9*,9)} . 
As l y {9) and I y (9*, 9) are both smooth in 9, 9* must be a stationary point 

Q = Vly(9*)-pV 01 I y {9*,9*). 

Thus, as by (pU) V 01 I y (9*,9*) = 0, 

= Vl y {9). (15) 
Since l y {9) is strictly concave, we deduce that 9* is a maximizer of l y {9). □ 

The following will be useful. 

Lemma 1 Let the conditional density k(x\y; 9) be such that I y (9, 9) satisfies Assump- 
tionUliv. Then, given two bounded sequences {9\} and {9%}, limfc_ ! , 00 I y (9\, 9%) = 
implies that lim^oo \\9^ — 9^11 = 0. 



Proof: Let B be any bounded set containing both sequences {9\ } and {#2 }- Let A 
denote the minimum 

A = min A V 2 r ( n 0) (16) 

Assumption [TJiv implies that A > 0. Furthermore, invoking Taylor's theorem with 
remainder, I y (9, 9) is strictly convex in the sense that for any k 

I y (9l9 k 2 ) > I y {9 k ll 9 , i)+VI y (9l9^(9 k l - 9\) 

+ X -\\\9\-9\t. 

As I y {9\,0\) = and V 01 I y (9'[,9^) = 0, recall (HH), we obtain 

Iy(0t0 k 2)>^\\0 k 1 -9 k 2 \\ 2 . 

The desired result comes from passing to the limit k —> 00. □ 

Using these results, we easily obtain the following. 
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Lemma 2 Let the densities g(y;9) and k(x\y;9) be such that Assumptions^ are 
satisfied. Then {# fc }fceN is bounded. 

Proof: Due to Proposition [21 the sequence {l y {9 k )} is monotone increasing. There- 
fore, assumption [TJii implies that {9 k } is bounded. □ 

In the following lemma, we prove a result which is often called asymptotic regu- 
larity (2j. 

Lemma 3 Let the densities g(y\9) and k(x\y;6) be such that l y {9) and I y (9,9) 
satisfy Assumptions d Let the sequence of relaxation parameters {/3fc}fceN satisfy 
< lim inf < lim sup < oo . Then, 

hm \\e k+1 -e k \\ = o. (17) 

k— ¥oo 

Proof: By Assumption [TJiii and by Proposition [2] Oy(0 fe )}fceN is bounded and 
monotone. Since, by Lemma [21 {9 k }keN 1S a bounded sequence {ly(0 k )}k£~N con- 
verges. Therefore, lim^oo {l y (9 k+1 ) — l y (9 k )} = which, from (113D . implies that 
PkL y (6 k \9 k+1 ) vanishes when k tends to infinity. Since {/3fc}fc G N is bounded below by 
liminf > 0: lim^oo L y (8 k , 9 k+1 ) = 0. Therefore, Lemma Q] establishes the desired 
result. □ 

We can now give a global convergence theorem. 

Theorem 1 Let the sequence of relaxation parameters {Pk}keN be positive and con- 
verge to a limit (5* £ [0, oo). Then the sequence {9 k }k£N converges to the solution of 
the ML estimation problem |ip. 

Proof: Since {# fc }fc G N is bounded, one can extract a convergent subsequence {H^jfcgN 
with limit 9*. The defining recurrence (|12p implies that 

V^(r« +1 ) - f3 a(k) V ol L y (0^ k \9^ +1 ) = 0. (18) 

We now prove that 9* is a stationary point of l y {9). Assume first that {/3fc}fceN 
converges to zero, i.e. /3* = 0. Due to Assumptions [TJi, Vl y (9) is continuous in 9. 
Hence, since VoiI y {9,9) is bounded on bounded subsets, (fT8j) implies 

Vl y (9*) = 0. 

Next, assume that /3* > 0. In this case, Lemma [3] establishes that 

lim \\9 k+1 -9 k \\ = 0. 

k— ¥oo 
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Therefore, {9 a ^ +1 } keN also tends to 9*. Since S7oiI y (9,9) is continuous in (6,0) 
equation (fl~8l) gives at infinity 

vi y (e*)-p*v 01 i y (e*,e*) = o. 

Finally, by (PU), X7 01 I y (9* ,9*) = and 

Vl y (9*) = 0. (19) 



The proof is concluded as follows. As, by Assumption [TJiii, l y (9) is concave, 6* is 
a maximizer of l y (6) so that 9* solves the Maximum Likelihood estimation problem 
(P). Furthermore, as positive definiteness of V 2 l y implies that l y (9) is in fact strictly 
concave, this maximizer is unique. Hence, {9 k } has only one accumulation point and 
{9 k } converges to 9* which ends the proof. □ 

We now establish the main result concerning speed of convergence. Recall that a 
sequence {9 k } is said to converge superlinearly to a limit 9* if: 

lim = 0, . (20) 



Theorem 2 Assume that the sequence of positive relaxation parameters {/3fc}fceN 
converges to zero. Then, the sequence {9 k } ke ^ converges superlinearly to the solution 
of the ML estimation problem )[7J). 



Proof: Due to TheoremHJ the sequence {9 k } converges to the unique maximizer 9ml 
of l y (9). Assumption [TJi implies that the gradient mapping Ve(l y (9) — (3 k Iy(9ML, 9)) 
is continuously differentiable. Hence, we have the following Taylor expansion about 
9 ml- 

Vl y (9)-p k VoiIy(e M L,9) = Vl y (9 ML ) 
— PkSmlyi^ML-, 9ml) 

+ V%(9 ml )(9-9 M l) (21) 

_ Pk^0lIy{9ML, Gml){G - 9ml) 

+ R(9-9 M l), 

where the remainder satisfies 

i, m iiy-'^'ii-o. 

6-+6ml \\9-9ml\\ 

Since 9ml maximizes l y {9), VI v (9ml) = 0. Furthermore, by (fTij) . VoiI v (9ml, 9ml) = 
0. Hence, (|2ip can be simplified to 

Vl y (9) - hV m I y {9 M L,9) = V%(9 M l)(9 - 9 M l) 

- f3 k V 2 01 I y (9 ML ,9 ML )(9 - 9 ML ) + R{9 - 9 ML ). (22) 
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Prom the defining relation (|12p the iterate 9 k+1 satisfies 

Vl y (0 k+1 ) - (3 k V iIy{6 k , 9 k+1 ) = 0. (23) 

So, taking 9 = 6 k+1 in ((22j) and using j23}, we obtain 
/9*(VoiI w (e fc ,0 fc+1 ) - V 01 I y (8 ML ,e k+1 )) = 
+ V 2 l y (9 ML )(9 k+1 - 9ml) - PkV 2 01 I y (8 ML , 8 ML )(8 k+1 - 8 ML ) 
+ R(9 k+1 -9 ML ). 

Thus, 

\MVoiI y (9 k ,9 k+1 ) -V iIy(9 M L,e k+1 )) - R(9 k+1 - 9 ML )\\ = 
\\V%(9 M L)(9 k+1 - 9ml) - PkV 2 01 I y (8 ML ,8 M L)(e k+1 - 9 M l)\\. (24) 

On the other hand, one deduces from Assumptions CD (i) that VqiI v (Q, 9) is locally 
Lipschitz in the variables 9 and 9. Then, since, {9 k } is bounded, there exists a 
bounded set B containing {9 k } and a finite constant L such that for all 9, 9', 9 and 
9' in B, 

||VoiI„(M) - Voil y {0',9')\\ < L(\\9 - 9'\\ 2 + \\9-9'\\ 2 ) l \ 
Using the triangle inequality and this last result, ()24j) asserts that for any 9 £ B 
p k L\\9 k - 9 ML \\ + \\R(9 k+1 - 9ml)\\ > || (V%(9 ML ) 

- f3 k V 2 1 I y (9 M L,9 M L))(9 k+1 - 9 ML )\\. (25) 

Now, consider again the bounded set B containing {9 k }. Let Xi and Xj denote the 
minima 

Since for any symmetric matrix H, x T Hx/\\x\\ 2 is lower bounded by the minimum 
eigenvalue of H, we have immediately that 

||(-V%(0 M l) + PkVl l Iy{9ML,9 ML ))(9 k+1 - 9 ML )\\ 2 

> (\ + Pk\if\\9 k+l -9 M L\\ 2 . (26) 

By Assumptions [TJiii and [Ijiv, Xi y + (3 k Xi > and, after substitution of ([26]) into 
(l25j) . we obtain 

p k L\\9 k -9 M L\\ + \\R(9 k+1 -9 ML )\\ > 

{Xi y +(3 k Xi)\\9 k+1 -9 ML \\, (27) 

for all 9 € B. Therefore, collecting terms in (I27p 



' * " H^i-^ll J \\9 k -9 M L\\ ■ (28) 



Now, recall that {9 k } is convergent. Thus, lim^oo \\9 — 9ml\\ = and subsequently, 
lim^^oo pg+TZe^| = due to the definition of the remainder i?. Finally, as /3 k 
converges to zero, L is bounded and Xj > 0, equation ([28]) gives ([20]) with 0* = #ml 
and the proof of superlinear convergence is completed. □ 
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5 Second order Approximations and Trust Re- 
gion techniques 



The maximization in the KPP recursion (|12p will not generally yield an explicit exact 
recursion in 9 k and 9 k+l . Thus implementation of the KPP algorithm methods may 
require line search or one-step-late approximations similar to those used for the M- 
step of the non-explicit penalized EM maximum likelihood algorithm [13]. In this 
section, we discuss an alternative which uses second order function approximations 
and preserves the convergence properties of KPP established in the previous section. 
This second order scheme is related to the well-known Trust Region technique for 
iterative optimization introduced by More |32j . 



5.1 Approximate models 

In order to obtain computable iterations, the following second order approximations 
of l y {9) and I y {9 k ,9) are introduced 

i y {9) = i y (e k ) + vi y (e k ) r (e-e k ) + 
l -(e-e k ) J H k {8-e k ). 

and 

i y (9,9 k ) = ±(9-9 k ) T V 2 01 I k (9-9 k ). 

In the following, we adopt the simple notation g k = Vl y (9 k ) (a column vector). A 
natural choice for H k and Jfe is of course 

H k = V%(9 k ) 

and 

i k = v 2 01 i y (e k ,e k ). 

The approximate KPP algorithm is defined as 

9 k+1 = argmax, eRP {/ y (^) + g k {9 - 9 k ) 

+ 1(0 _ e k ) r H k (9 - e k ) (29) 

-tj(9-9 k ) T I k (9-9 k )} 

At this point it is important to make several comments. Notice first that for 
Pk = 0, k = 1, 2, . . ., and H k = V 2 l y (9 k ), the approximate step (|29|) is equivalent 
to a Newton step. It is well known that Newton's method, also known as Fisher 
scoring, has superlinear asymptotic convergence rate but may diverge if not properly 
initialized. Therefore, at least for small values of the relaxation parameter /3 k , the 
approximate PPA algorithm may fail to converge for reasons analogous in Newton's 
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method [37]. On the other hand, for (3 k > the term - ^-(6 - 6 h ) T I k (d - 6 k ) penalizes 
the distance of the next iterate 9 k+l to the current iterate 9 k . Hence, we can interpret 
this term as a regularization or relaxation which stabilizes the possibly divergent 
Newton algorithm without sacrificing its superlinear asymptotic convergence rate. 
By appropriate choice of the iterate 9 k+l can be forced to remain in a region 
around 6 k over which the quadratic model l y (0) is accurate [32j[3j. 

In many cases a quadratic approximation of a single one of the two terms l y {0) or 
I y (9 k ,9) is sufficient to obtain a closed form for the maximum in the KPP recursion 
(|12p . Naturally, when feasible, such a reduced approximation is preferable to the ap- 
proximation of both terms discussed above. For concreteness, in the sequel, although 
our results hold for the reduced approximation also, we only prove convergence for 
the proximal point algorithm implemented with the full two-term approximation. 

Finally, note that (129f) is quadratic in 9 and the minimization problem clearly 
reduces to solving a linear system of equations. For 9 of moderate dimension, these 
equations can be efficiently solved using conjugate gradient techniques [34]. How- 
ever, when the vector 9 in (|29[) is of large dimension, as frequently occurs in inverse 
problems, limited memory BFGS quasi-Newton schemes for updating H k — Pk^k may 
be computationally much more efficient, see for example [M], [35], [27] . |12] and [1 lj . 



5.2 Trust Region Update Strategy 

The Trust Region strategy proceeds as follows. The model l y {9) is maximized in 
a ball B(9 k ,5) = {\\9 — 9 k \\i k < <5} centered at 9 k where 5 is a proximity control 
parameter which may depend on k, and where \\a\\i k = a T Ika is a norm; well defined 
due to positive definiteness of I k (Assumption [TJiv). Given an iterate 9 k consider a 
candidate 9 s for 9 k+1 defined as the solution to the constrained optimization problem 

9 s = &rgmax SGRP iy(9) 

subject to 

\\9 - 9 k \\ Ik < 8. (30) 

By duality theory of constrained optimization [IB], and the fact that l y {9) is strictly 
concave, this problem is equivalent to the unconstrained optimization 

9 5 {P) = argmin eeRP L(0,/3). (31) 

where 

L(9,(3) = -i y (9) + ^(\\9-9 k \\l-5 2 ). 

and (3 is a Lagrange multiplier selected to meet the constraint ([30]) with equality: 
\\9 s ((3)-9\\ Ik =6. 

We conclude that the Trust Region candidate 9 s is identical to the approximate 
KPP iterate (|29p with relaxation parameter (3 chosen according to constraint (|30p . 
This relation also provides a rational rule for computing the relaxation parameter /3. 
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5.3 Implementation 



The parameter 5 is said to be safe if 6 s produces an acceptable increase in the original 
objective l y . An iteration of the Trust Region method consists of two principal steps 

Rule 1. Determine whether 5 is safe or not. If 5 is safe, set 5k = 5 and take 
an approximate Kullback proximal step 9 k+1 = 9 s . Otherwise, take a null step 

Rule 2. Update 5 depending on the result of Rule 1. 

Rule 1 can be implemented by comparing the increase in the original log-likelihood 
l y to a fraction m of the expected increase predicted by the approximate model l y (0). 
Specifically, the Trust Region parameter 5 is accepted if 

ly{9 s )-l y {9 k )>m{i y {9 5 )-i y {e k )). (32) 

Rule 2 can be implemented as follows. If 5 was accepted by Rule 1, 5 is increased at 
the next iteration in order to extend the region of validity of the model l y {9). If 5 
was rejected, the region must be tightened and 5 is decreased at the next iteration. 

The Trust Region strategy implemented here is essentially the same as that pro- 
posed by More [32] . 

Algorithm 1 Step 0. (Initialization) Set 9° £ R p ; So > and the "curve search" 
parameters m, m! with < m < m' < 1. 

Step 1. With l y {9) the quadratic approximation l2§\) , solve 

9 5k = argmaxggRpf^) 

subject to 

\\9 - 9% k < 5 k . 

Step 2. Ifl y {9 &k ) - l y (9 k ) > m{i y {9 6 ") - l y {9 k )) then set 9 k+1 = 9 s ". Otherwise, 
set 9 k+1 = 9 k .' 

Step 3. Set k = k + 1. Update the model l y (9 k ). Update 5k using Procedure^ 
Step 4- Go to Step 1. 

The procedure for updating 5k is given below. 

Procedure 1 Step 0. (Initialization) Set 71 and 72 such that 71 < 1 < 72- 
Step 1. Ifl y {9 Sk ) - l y {9 k ) < m(l y (9 5k ) - l y (9 k )) then take 5 k+ i G (0,7i^). 



16 



Step 2. Ifly{e Sk ) - l y {6 k ) < m'(i y (6 5 k) - l y {9 k )) then take 5 k+1 G (jiS k ,S k ). 
Step 3. Ifl y {6 Sk ) - l y (6 k ) > m'(i y {6 Sk ) - l y {9 k )) then take 5 k+1 G (S k ,j 2 S k ). 

The Trust Region algorithm satisfies the following convergence theorem 

Theorem 3 Let g(y; 9) and k(x\y; 9) be such that Assumptions 1 are satisfied. Then, 
{0 k } generated by Algorithm^ converges to the maximizer Oml of the log-likelihood 
l y {9) and satisfies the monotone likelihood property l y (9 k+1 ) > l y (9 k ). If in addi- 
tion, the sequence of Lagrange multipliers {/3 k } tends towards zero, {0 k } converges 
superlinearly. 

The proof of Theorem [3] is omitted since it is standard in the analysis of Trust 
Region methods; see [521 [M]. Super linear convergence for the case that lim^oo (3 k = 
follows from the Dennis and More criterion [3j Theorem 3.11]. 



5.4 Discussion 

The convergence results of Theorems 1 and 2 apply to any class of objective functions 
which satisfy the Assumptions [TJ For instance, the analysis directly applies to the 
penalized maximum likelihood (or posterior likelihood) objective function l y {9) = 
l y {9) +p{0) when the ML penalty function (prior) p{9) is quadratic and non- negative 
of the form p{9) = {9- 9 Q ) T R(9 - 9 Q ), where R is a non-negative definite matrix. 

The convergence Theorems 1 and 2 make use of concavity of l y {9) and convexity 
of I y (9,9) via Assumptions (TJiii and[TJiv. However, for smooth non-convex functions 
an analogous local superlinear convergence result can be established under some- 
what stronger assumptions similar to those used in [15]. Likewise the Trust Region 
framework can also be applied to nonconvex objective functions. In this case, global 
convergence to a local maximizer of l y {9) can be established under Assumptions [Hi, 
[TJ ii and[TJiv following the proof technique of |32j. 



6 Application to Poisson data 

In this section, we illustrate the application of Algorithm [1] for a maximum likelihood 
estimation problem in a Poisson inverse problem arising in radiography, thermionic 
emission processes, photo-detection, and positron emission tomography (PET). 
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6.1 The Poisson Inverse Problem 



The objective is to estimate the intensity vector 9 = [9\, . . . , 6 P ] T governing the num- 
ber of gamma-ray emissions ./V = [N\, . . . , N p ] T over an imaging volume of p pixels. 
The estimate of 9 must be based on a vector of m observed projections of N denoted 
Y = [Yi, ... ,Y m ] T . The components iVj of N are independent Poisson distributed 
with rate parameters 9i, and the components Yj of Y are independent Poisson dis- 
tributed with rate parameters X^=i -^'^j w bere Pji is the transition probability; 
the probability that an emission from pixel i is detected at detector module j. The 
standard choice of complete data X, introduced by Shepp and Vardi [39], for the EM 
algorithm is the set {Nji}i<j< m , i<i<p-, where Nji denotes the number of emissions 
in pixel i which are detected at detector j. The corresponding many-to-one mapping 
h(X) = Y in the EM algorithm is 

p 

Y 3=J2 N ^ l <j< m - ( 33 ) 
i=l 

It is also well known |39] that the likelihood function is given by 

m p p 

log g(y; 0) = £ ( E P3&) " Vj log ( ^ PjiOi) + log Vj \ (34) 
j=l i=l i=l 

and that the expectation step of the EM algorithm is (see |13| ) 

Q(9,9) = E[logf(x;9)\y;9}= (35) 

m p 13 a 

EE(yf^V log(p ^ ) - p ^)- 

j=l i=l x L,i=i r ii°i 
Let us make the following additional assumptions: 

• the solution(s) of the Poisson inverse problem is (are) positive 

• the level set 

C = {9 £ R n | l y {9) > l y {9 1 )} (36) 
is bounded and included in the positive orthant. 

Then, since l y is continuous, C is compact. Due to the monotonicity property of 
{9 k }, we thus deduce that for all k, 9\ > 7 for some 7 > 0. Then, the likelihood 
function and the regularization function are both twice continuously differentiable on 
the closure of {9 k } and the theory developed in this paper applies. These assumptions 
are very close in spirit to the assumptions in Hero and Fessler |15j , except that we do 
not require the maximizer to be unique. The study of KPP without these assumptions 
requires further analysis and is addressed in [6]. 
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6.2 Simulation results 



For illustration we performed numerical optimization for a simple one dimensional 
deblurring example under the Poisson noise model of the previous section. This 
example easily generalizes to more general 2 and 3 dimensional Poisson deblurring, 
tomographic reconstruction, and other imaging applications. The true source is 
a two rail phantom shown in Figure HJ The blurring kernel is a Gaussian function 
yielding the blurred phantom shown in Figure [2j We implemented both EM and KPP 
with Trust Region update strategy for deblurring Fig. [2] when the set of ideal blurred 
data Yi = Y2j=i ls available without Poisson noise. In this simple noiseless case 
the ML solution is equal to the true source 9 which is everywhere positive. Treatment 
of this noiseless case allows us to investigate the behavior of the algorithms in the 
asymptotic high count rate regime. More extensive simulations with Poisson noise 
will be presented elsewhere. 

The numerical results shown in Fig. [3] indicate that the Trust Region imple- 
mentation of the KPP algorithm enjoys significantly faster convergence towards the 
optimum than does EM. For these simulations the Trust Region technique was imple- 
mented in the standard manner where the trust region size sequence 5k in Algorithm 
1 is determined implicitly by the j3k update rule: Pk+i = l-6At {$k 1S decreased) and 
otherwise fik+i = 0.5/3^ (<5fc is increased). The results shown in Fig. [4] validate the 
theoretical superlinear convergence of the Trust Region iterates as contrasted with 
the linear convergence rate of the EM iterates. Figure [5] shows the reconstructed pro- 
file and demonstrates that the Trust Region updated KPP technique achieves better 
reconstruction of the original phantom for a fixed number of iterations. Finally, 
Figure [6] shows the iterates for the reconstructed phantom, plotted as a function of 
iteration on the horizontal axis and as a function of grey level on the vertical axis. 
Observe that the KPP achieves more rapid separation of the two components in the 
phantom than does standard EM. 



7 Conclusions 



The main contributions of this paper are the following. First, we introduced a general 
class of iterative methods for ML estimation based on Kullback-Liebler relaxation of 
the proximal point strategy. Next, we proved that the EM algorithm belongs to the 
proposed class, thus providing a new and useful interpretation of the EM approach 
for ML estimation. Finally, we showed that Kullback proximal point methods enjoy 
global convergence and even superlinear convergence for sequences of positive relax- 
ation parameters that converge to zero. Implementation issues were also discussed 
and we proposed second order schemes for the case where the maximization step is 
hard to obtain in closed form. We addressed Trust Region methodologies for the 
updating of the relaxation parameters. Computational experiments indicated that 
the approximate second order KPP is stable and verifies the superlinear convergence 
property as was predicted by our analysis. 
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Figure 1: Two rail phantom for ID deblurring example. 
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Figure 2: Blurred two level phantom. Blurring kernel is Gaussian with standard width 
approximately equal to rail separation distance in phantom. An additive randoms noise of 
0.3 was added. 
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Figure 3: Snapshot of log-Likelihood vs iteration for plain EM and KPP EM algorithm. 
Plain EM initially produces greater increases in likelihood function but is overtaken by 
KPP EM at 7 iterations and thereafter. 
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Figure 6: Evolution of the reconstructed source vs iteration for plain EM and KPP EM. 
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